Language identification in Complex, Unoriented, and Degraded Document Images

نویسندگان

  • Dar-Shyang Lee
  • Craig R. Nohl
  • Henry S. Baird
چکیده

We describe algorithms for identifying the language of text in document images which are complex, unoriented, and degraded. We distinguish among seven lan-page layouts may be complex, containing text blocks in unknown roughly Manhat-tan arrangements. The pages may be unoriented, that is, upright or rotated by 90, 180, or 270 degrees. The images may be degraded by digitization at coarse and unequal spatial sampling rates as in FAXes. We begin by segmenting the page into text lines in a manner oblivious to page skew and both page and text-line orientation. Then we distinguish between Asian and Latin scripts at any orientation. Chinese versus Japanese is decided at any orientation, and then their orientation is detected. On Latin scripts, we detect rst orientation and then language. A variety of decision procedures are used, some hand-crafted (e.g. using spatial features and optical density distributions) and others trainable (e.g. using word unigram relative entropy models). Tests on 1088 standard (low) resolution FAX images show that our method accurately identiies scripts (98.16%), and language and page orientations (94.76%).

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Script and Language Identification in Degraded and Distorted Document Images

This paper reports a statistical identification technique that differentiates scripts and languages in degraded and distorted document images. We identify scripts and languages through document vectorization, which transforms each document image into an electronic document vector that characterizes the shape and frequency of the contained character and word images. We first identify scripts bas...

متن کامل

Language Identification in Degraded and Distorted Document Images

This paper presents a language identification technique that differentiates Latin-based languages in degraded and distorted document images. Different from the reported methods that transform word images through a character shape coding process, our method directly captures word shapes with the local extremum points and the horizontal intersection numbers, which are both tolerant of noise, char...

متن کامل

Degraded Script Identification for Indian Language- A Survey

The working module of any Optical character Recognition system almost depends upon printing and paper of the input document image. A number of OCR techniques are available and claim correctly identified accuracy in printed document image in Indian and foreign script. A few report have been found on the recognition of the degraded Indian language document. The degradation in any scanned printed ...

متن کامل

Font and Function Word Identification in Document Recognition

font would be used during recognition. This would reduce An algorithm is presented that identifies the predominant font in which the running text in an English language document the confusion caused by training on many fonts and would is printed. Frequent function words (such as the, of, and, a, effectively reduce the recognition problem to choosing the and to) are also recognized as part of th...

متن کامل

روش جدید متن‌کاوی برای استخراج اطلاعات زمینه کاربر به‌منظور بهبود رتبه‌بندی نتایج موتور جستجو

Today, the importance of text processing and its usages is well known among researchers and students. The amount of textual, documental materials increase day by day. So we need useful ways to save them and retrieve information from these materials. For example, search engines such as Google, Yahoo, Bing and etc. need to read so many web documents and retrieve the most similar ones to the user ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1996